Skip to content

Improve child process termination on POSIX & Windows #1078

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 8, 2025

Conversation

felixweinberger
Copy link
Contributor

@felixweinberger felixweinberger commented Jul 3, 2025

Motivation and Context

#729, and #850 address important failures to shut down childprocesses when MCP servers spawn subprocesses.

This is not a trivial thing to fix, as there's no great cross-platform library for terminating process-trees on all platforms so we have to fork between Unix (using os.killpg) and Windows (using JobObjects via pywin32 which is the best supported external library for this: https://github.com/mhammond/pywin32)

This PR adds regressions tests to clearly demonstrate the edge cases addressed + unifies the approaches from the PRs listed above.

How Has This Been Tested?

New regression tests added (each fails without the relevant fix). These were developed and tested on both POSIX and Windows systems.

For #729 and #850 (properly terminate processes and their children across platforms):

Breaking Changes

None.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update

Checklist

  • I have read the MCP Documentation
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have added or updated documentation as needed

Additional context

@felixweinberger felixweinberger changed the base branch from main to fweinberger/stream-cleanup-only-approach July 3, 2025 11:31
@felixweinberger felixweinberger changed the title Improve child process termination on POSIX & Windows [WIP] Improve child process termination on POSIX & Windows Jul 3, 2025
@felixweinberger felixweinberger force-pushed the fweinberger/stream-cleanup-only-approach branch 10 times, most recently from 0205aa0 to 2be6f09 Compare July 4, 2025 17:13
@felixweinberger felixweinberger force-pushed the fweinberger/child-process-termination branch 7 times, most recently from 80f8217 to 12114f7 Compare July 4, 2025 19:36
@felixweinberger felixweinberger marked this pull request as ready for review July 4, 2025 19:43
@felixweinberger felixweinberger requested a review from dsp-ant July 4, 2025 19:43
@felixweinberger felixweinberger changed the title [WIP] Improve child process termination on POSIX & Windows Improve child process termination on POSIX & Windows Jul 4, 2025
@felixweinberger
Copy link
Contributor Author

@dsp-ant as discussed platform native ways to terminate processes for Unix & Windows without race conditions between child enumeration and parent shutdown.

@jingx8885 I had to break this out of #1044 to isolate this particular problem. If you'd be willing to test whether this branch still resolves your issue like you did on #1044 that would be greatly appreciated! The tests still pass, but just to make sure there's nothing we missed.

@felixweinberger felixweinberger force-pushed the fweinberger/child-process-termination branch 4 times, most recently from 1a561c6 to f0633a3 Compare July 4, 2025 20:05
@jingx8885
Copy link
Contributor

@felixweinberger got it, I'll deal with it in a few days.

@felixweinberger felixweinberger force-pushed the fweinberger/stream-cleanup-only-approach branch from 7af9e65 to c9b43bc Compare July 7, 2025 13:59
@felixweinberger felixweinberger force-pushed the fweinberger/child-process-termination branch from f0633a3 to 54be4d6 Compare July 7, 2025 14:01
@felixweinberger felixweinberger requested a review from Kludex July 7, 2025 14:17
@felixweinberger
Copy link
Contributor Author

felixweinberger commented Jul 7, 2025

@Kludex @agronholm given I've seen you both on Python process management related PRs, tagging y'all here in case you have any thoughts on whether this could or should be simpler with anyio 🙂

This is an edge case where terminating a process doesn't automatically terminate children via anyio's terminate(), but I'm also not sure whether it should in the first place as Python's asyncio doesn't do so either AFAIK.

For context, we need the subprocesses to be killed in the stdio case of MCP, because the MCP "server" is started as a subprocess of the "client". When spinning down the client, it should also spin down that server to free up resources.

@@ -212,3 +215,306 @@ def sigint_handler(signum, frame):
)
else:
raise


class TestChildProcessCleanup:
Copy link
Contributor Author

@felixweinberger felixweinberger Jul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aware that using "timeout" anywhere in a test is usually not great practice - however, cross platform challenges and the subject matter being tested (processes) make this a little trickier to resolve than usual, as these tests are fundamentally there to detect whether processes are being closed (which in these tests should take a negligible, but non-zero amount of time).

I ran a deep research on alternative approaches - here's @claude 's POV:

⏺ Deep Analysis: Is the Time-Based Testing Approach Justified?

After researching testing best practices and ultrathinking about this specific case, yes, the approach is justified. Here's why:

The Tests Are Actually Condition-Based, Not Time-Based

The tests check a condition: "Is the file size stable?" This indicates process termination. The sleep() calls create measurement windows, not synchronization points. This is fundamentally
different from hoping timing works out.

Why Alternative Approaches Don't Work Here

  1. Process polling (poll()/wait()): Only works for known process handles. This PR tests termination of entire process trees including unknown grandchildren.
  2. Mock-based testing: Process groups and Windows Job Objects require real OS interaction. The bugs being fixed are OS-level issues that mocks can't reproduce.
  3. Event/signal-based coordination: Would require modifying child processes, but the goal is to test termination of arbitrary processes (like npx).
  4. OS monitoring APIs: Platform-specific and still require polling. The file-based approach is simpler and more portable.

The Approach Is Well-Designed

  • Black-box testing: Verifies actual behavior (processes stop working) rather than just process state
  • Cross-platform: Works identically on Windows and Unix without platform-specific test code
  • Real-world scenario: Tests continuous I/O patterns common in development servers
  • Minimal timing: Uses 0.1s intervals and reasonable timeouts
  • Comprehensive: Tests single children, nested trees, and race conditions

Critical Context

These tests provide regression coverage for production bugs where child processes weren't properly terminated. The ability to reliably test this behavior outweighs the theoretical concerns
about time-based tests.

Recommendations

  1. Consider adding a comment explaining why the file-based approach is used:
# We use file writing as a cross-platform way to verify process liveness
# without needing to know child PIDs or use platform-specific APIs
  1. The ResourceWarning filter is appropriately documented and justified.
  2. Future enhancement: Consider adding a process count assertion on Unix using ps to supplement the file-based checks.

Verdict

This PR provides a robust, well-tested solution to critical process management bugs. The testing approach, while using timing windows, is actually condition-based and represents the most
pragmatic way to verify process tree termination across platforms. The implementation quality is high with proper platform abstractions and error handling.

import pywintypes
import win32api
import win32con
import win32job
Copy link
Contributor

@jingx8885 jingx8885 Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixweinberger I saw you used win32api instead of psutil, and used win32job.TerminateJobObject in terminate_windows_process_tree function to kill all processes, which is more simple and elegance, I tested in local and it‘s worked, great job!

Copy link
Contributor Author

@felixweinberger felixweinberger Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for testing it, really appreciate you taking the time!

@felixweinberger felixweinberger force-pushed the fweinberger/child-process-termination branch 2 times, most recently from c7221f3 to 046bed4 Compare July 8, 2025 11:25
@felixweinberger felixweinberger requested a review from ihrpr July 8, 2025 11:28
ihrpr
ihrpr previously approved these changes Jul 8, 2025
Copy link
Contributor

@ihrpr ihrpr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

dsp-ant
dsp-ant previously approved these changes Jul 8, 2025
Copy link
Member

@dsp-ant dsp-ant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Thank you for working through this. My main comment here is about just improving the way we organize platform code. I would consider doing that before pushing the PR.

felixweinberger and others added 3 commits July 8, 2025 14:59
When terminating MCP servers, child processes were being orphaned
because only the parent process was killed. This caused resource
leaks and prevented proper cleanup, especially with tools like npx
that spawn child processes for the actual server implementation.

This was happening on both POSIX and Windows systems - however
because of implementation details, resolving this is non-trivial
and requires introducing psutil to introduce cross-platform
utilities for dealing with children and process trees.

This addresses critical issues where MCP servers using process
spawning tools would leave zombie processes running after client
shutdown.

resolves #850
resolves #729

Co-authored-by: jingx8885 <[email protected]>
Co-authored-by: Surya Prakash Susarla <[email protected]>
Replace manual deadline checking with anyio.move_on_after context manager
for cleaner and more idiomatic timeout handling in Unix process group
termination.
Per @dsp-ant's review feedback, reorganize platform-specific utilities:
- Create src/mcp/os/ with win32/ and posix/ subdirectories
- Move all Windows-specific code to mcp.os.win32.utilities
- Extract POSIX process termination to mcp.os.posix.utilities
- Update logger to use __name__ instead of hardcoded string
- Rename timeout parameter to timeout_seconds for clarity
- Document that missing PID means no process to terminate

This structure allows platform utilities to be shared across client,
server, and other components as needed.
@felixweinberger felixweinberger force-pushed the fweinberger/child-process-termination branch from 8de813b to 1a9b776 Compare July 8, 2025 13:59
@felixweinberger felixweinberger removed the request for review from Kludex July 8, 2025 13:59
@felixweinberger felixweinberger merged commit e033621 into main Jul 8, 2025
13 checks passed
@felixweinberger felixweinberger deleted the fweinberger/child-process-termination branch July 8, 2025 14:04
@surya-prakash-susarla
Copy link
Contributor

Thank you for pushing the fix!

@sungigant
Copy link

saqadri pushed a commit to saqadri/stdio-fixes that referenced this pull request Aug 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants